4 research outputs found

    GPU NTC Process Variation Compensation with Voltage Stacking

    Get PDF
    Near-threshold computing (NTC) has the potential to significantly improve efficiency in high throughput architectures, such as general-purpose computing on graphic processing unit (GPGPU). Nevertheless, NTC is more sensitive to process variation (PV) as it complicates power delivery. We propose GPU stacking, a novel method based on voltage stacking, to manage the effects of PV and improve the power delivery simultaneously. To evaluate our methodology, we first explore the design space of GPGPUs in the NTC to find a suitable baseline configuration and then apply GPU stacking to mitigate the effects of PV. When comparing with an equivalent NTC GPGPU without PV management, we achieve 37% more performance on average. When considering high production volume, our approach shifts all the chips closer to the nominal non-PV case, delivering on average (across chips) ˜80 % of the performance of nominal NTC GPGPU, whereas when not using our technique, chips would have ˜50 % of the nominal performance. We also show that our approach can be applied on top of multifrequency domain designs, improving the overall performance

    Optimizations for energy efficiency in GPGPU architectures

    No full text
    It is commonplace for graphics processing units or GPUs today to render extremely complex 3D scenes and textures, in real time, both in the traditional and mobile computing spaces. The computational power required to do this makes them a valuable resource to exploit for general purpose computation. In order to map programs originally designed for sequential CPUs onto massively parallel GPU architectures, it would be necessary to justify the transition with huge performance benefits. Over the last couple of years, there have been numerous proposals to improve the performance of GPUs used for general purpose computing (GPGPUs), but without much consideration for energy efficiency. In my dissertation, I evaluate the feasibility of GPGPUs from an energy perspective and propose some optimizations based on the unique programming model used by GPGPUs. First, I describe the simulation infrastructure, one of the few available to model GPGPUs today, both individually and as part of a heterogeneous system. Next, I propose a design using a shared translation lookaside buffer (TLB) to eliminate chronic memory copies between the CPU and GPU addressing spaces, making heterogeneous CPU-GPU designs energy efficient. Furthermore, to improve the energy efficiency of the on-chip memory hierarchy, I propose adding tiny incoherent caches per processing element, which can filter out frequent accesses to large shared and energy-inefficient cache structures. Finally, I evaluate a design which moves away from the underlying SIMD architecture of GPUs towards a more MIMD-like architecture, enabling the execution of both CPU and GPGPU workloads without negatively affecting the energy efficiency availed by traditional workloads on GPGPUs

    An Energy Efficient GPGPU Memory Hierarchy with Tiny Incoherent Caches

    No full text
    With progressive generations and the ever-increasing promise of computing power, GPGPUs have been quickly growing in size, and at the same time, energy consumption has become a major bottleneck for them. The first level data cache and the scratchpad memory are critical to the performance of a GPGPU, but they are extremely energy inefficient due to the large number of cores they need to serve. This problem could be mitigated by introducing a cache higher up in hierarchy that services fewer cores, but this introduces cache coherency issues that may become very significant, especially for a GPGPU with hundreds of thousands of in-flight threads. In this paper, we propose adding incoherent tinyCaches between each lane in an SM, and the first level data cache that is currently shared by all the lanes in an SM. In a normal multiprocessor, this would require hardware cache coherence between all the SM lanes capable of handling hundreds of thousands of threads. Our incoherent tinyCache architecture exploits certain unique features of the CUDA/OpenCL programming model to avoid complex coherence schemes. This tinyCache is able to filter out 62 % of memory requests that would otherwise need to be serviced by the DL1G, and almost 81 % of scratchpad memory requests, allowing us to achieve a 37 % energy reduction in the on-chip memory hierarchy. We evaluate the tinyCache for different memory patterns and show that it is beneficial in most cases
    corecore